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ABSTRACT 

In this paper, we tackle the problem of constructing a differentially 
private synopsis for the classification analyses. Several the state-of- 
the-art methods follow the structure of existing classification algo¬ 
rithms and are all iterative, which is suboptimal due to the locally 
optimal choices and the over-divided privacy budget among many 
sequentially composed steps. Instead, we propose a new approach, 
PrivPfC, a new differentially private method for releasing data for 
classification. The key idea is to privately select an optimal parti¬ 
tion of the underlying dataset using the given privacy budget in one 
step. Given one dataset and the privacy budget, PrivPfC constructs 
a pool of candidate grids where the number of cells of each grid is 
under a data-aware and privacy-budget-aware threshold. After that, 
PrivPfC selects an optimal grid via the exponential mechanism by 
using a novel quality function which minimizes the expected num¬ 
ber of misclassified records on which a histogram classifier is con¬ 
structed using the published grid. Finally, PrivPfC injects noise 
into each cell of the selected grid and releases the noisy grid as the 
private synopsis of the data. If the size of the candidate grid pool 
is larger than the processing capability threshold set by the data cu¬ 
rator, we add a step in the beginning of PrivPfC to prune the set of 
attributes privately. We introduce a modified yf quality function 
with low sensitivity and use it to evaluate an attribute’s relevance to 
the classification label variable. Through extensive experiments on 
real datasets, we demonstrate PrivPfC’s superiority over the state- 
of-the-art methods. 

1. INTRODUCTION 

We study the problem of publishing histograms of datasets while 
satisfying differential privacy. A histogram is an important tool 
for summarizing data, and can serve as the basis for many data 
analysis tasks. Publishing noisy histograms for one-dimensional or 
two-dimensional datasets have been studied extensively in recent 
years (2l]|4l][ll|2ilEl|40l|39l|9l|33l|32). However, as noticed 
in I39II33I , these approaches do not work well when the number of 
attributes/dimensions goes above a few. Many datasets that are of 
interest have multiple attributes. In this paper, we focus on multi¬ 
attribute datasets that have dozens of attributes, some of categorical 
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and some are numerical. 

For such a multi-attribute dataset, it is infeasible to publish a 
histogram with all the attributes, therefore it is necessary to select a 
subset of the attributes that are “interesting” for some intended data 
analysis tasks, and to determine how to discretize the attributes. 
These selections partition the domain into a number of cells. We 
call the result a “grid”. We consider a common optimization ob¬ 
jective, where the dataset includes a label attribute and our goal is 
to ensure that classifiers that are accurate for the original dataset 
can be learnt from the published noisy histograms. Classification 
is an important tool for data analysis, and differentially private al¬ 
gorithms for learning classifiers have been considered an important 
problem, with many recent attempts omiiniiiiiTiiiaiiiiiiiiiiii 

ED. 

In this paper we propose the PrivPfC (Private Publication for 
Classification) approach for publishing projected histograms. The 
key novelty is to privately select a high-quality grid in a single step, 
while adapting to the privacy budget e. We construct a set of can¬ 
didate grids where the number of cells is under a certain threshold 
(determined by the dataset size and the privacy budget), and then 
use the exponential mechanism to select one grid using a novel 
quality function that minimizes expected number of misclassified 
records when a histogram classifier is constructed using the pub¬ 
lished histogram. By construction, our quality function considers 
the impact of injected Laplace noise to the histogram on the classi¬ 
fication accuracy. 

For high dimensional datasets, the size of the set of candidate 
grids might be larger than the processing capacity of the data cura¬ 
tor. We add a feature selection step in the beginning to prune the set 
of attributes. This step first privately selects a small number of at¬ 
tributes which are most relevant to the class attribute by employing 
the exponential mechanism. We introduce a modified yf correla¬ 
tion function that has low sensitivity while evaluating an attribute’s 
relevance to the classification label variable. This feature selection 
step enables our PrivPfC framework to scale to higher dimensional 
datasets. 

Our proposed PrivPfC outputs a histogram that can be used to 
generate synthetic data for multiple data analysis tasks, while be¬ 
ing optimized for data classification. We show the effectiveness of 
PrivPfC by comparing it with several other approaches that output 
a classifier in a differentially private fashion. 

For evaluation, we use two common classification algorithms, 
the decision tree and the SVM, because these have been used in 
the literature on learning classifiers while satisfying the differential 
privacy. Extensive experiments on real datasets show that PrivPfC 
consistently and significantly outperforms other state-of-the-art meth¬ 
ods. 

The contributions of this paper are summarized as follows: 


1. We propose PrivPfC, a novel framework for publishing data 
for classification under differential privacy. As part of PrivPfC, 
we introduce a new quality function that enables the selection 
of a good “grid” for publishing noisy histograms. We also 
introduce a way to enable private selection of most relevant 
features for classification, and to enable PrivPfC to scale to 
higher-dimension datasets. 

2. Through extensive experiments on real datasets, we have com¬ 
pared PrivPfC against several other state-of-the-art methods 
for data publishing as well as private classification, demon¬ 
strating that PrivPfC improves the state-of-the-art. 

The rest of the paper is organized as follows. In Section 2, we 
review the related work. Our PrivPfC approach is presented in Sec¬ 
tion 3. We report experimental results in Section 4. Section 5 con¬ 
cludes our work. 

2. RELATED WORK 

The notion of differential privacy was developed in a series of 
papers El [H m [m [m. There are several primitives for sat¬ 
isfying e-differential privacy. In this paper we use two of them. 
The first primitive is the Laplacian mechanism El- It adds noise 
sampled from a Laplace distribution to a statistic / to be released. 
The scale of the Laplace distribution is proportional to GS/, the 
global sensitivity or the Li sensitivity of /. Another primitive is to 
sample the output of the data analysis mechanism according to an 
exponential distribution; this is generally referred to as the expo¬ 
nential mechanism 1281 . The mechanism relies on a quality func¬ 
tion g : X 7?, —>■ R that assigns a real valued score to one output 
r £ TZ when the input dataset is D, where higher scores indicate 
more desirable outputs. Given the quality function q, its global 
sensitivity GS, is defined as: 

GSq = max max \q{D, r) — q{D', r)|. 

r D~D' 

The following method satisfies e-differential privacy: 

Pr [r is selected] oc . 

There has been a large body of works on differentially private 
histogram construction for answering range queries or marginal 
queries ElllQl|2l]|9l|42l[33l[27l[33ll5l. 

Differentially Private Classification. Differentially private clas¬ 
sification has received growing attention in the research commu¬ 
nity (3llH[l2l[30l|71|24l[23[39lEill3Il- Blum et al. 0 suggested 
a solution for constructing the private version of the ID3 (35) deci¬ 
sion tree classifier. When the ID3 algorithm needs to get the num¬ 
ber of tuples with a specific feature value, it queries the SuLQ in¬ 
terface to get the corresponding noise count. Friedman and Schus¬ 
ter 03 improved this approach by redesigning the classic ID3 clas¬ 
sifier construction algorithm to consider the feature quality function 
with low sensitivity and using exponential mechanism to evaluate 
all the attributes simultaneously. Chaudhuri et al. (HQ proposed a 
differentially private logistic regression algorithm and later gener¬ 
alized this idea to address the private empirical risk minimization 
which can be applied to a wider range of classification problems, 
such as SVM classification. Zhang et al. 1461 proposed PrivGene, 
a general private model fitting framework based on genetic algo¬ 
rithms, that can be applied to the SVM classification and the logis¬ 
tic regression. 

Besides the above interactive methods for constructing differen¬ 
tially private classifiers, several works proposed solutions to pub¬ 
lish data for classification analysis tasks. Mohammed et al. 1301 


proposed the DiffGen algorithm which first partitions the data do¬ 
main by iteratively selecting attributes and ways to discretize the at¬ 
tributes, and then injects Laplace noise into each cell of all the leaf 
partitions. Vinterbo 1391 proposed another data publishing algo¬ 
rithm, called Private Projected Histogram (PPH). PPH first decides 
how many attributes are to be selected, then incrementally selects 
attributes via the exponential mechanism to maximize the discerni- 
bility of the selected attributes. For each categorical attribute, the 
full domain is used. For numerical attribute, it uses the formula 
proposed in Lei 1261 to decide how many bins to discretize them. 
In this method, the number of attributes and how attributes are par¬ 
titioned are independent of the privacy budget. Furthermore, all 
selected attributes are treated equally. Zhang et al. na presented 
PrivBayes which constructs a private a Bayesian network through 
iteratively selecting sets of attributes that have maximum mutual 
information via the exponential mechanism. It then injects Laplace 
noise to perturb each conditional distribution of the network. We 
will further analyze the above approaches and compare our pro¬ 
posed method with them in the later sections. 

3. PrivPfC FRAMEWORK 

In this section we present the PrivPfC framework of privately 
publishing data for classification analysis. 

3.1 Preliminaries 

We consider a dataset with a set of predictor variables and one 
binary response variable. The predictor variables can be numerical 
or categorical. Following (2l[22l|T8l[30), for each predictor variable 
Ai, we assume the existence of a taxonomy hierarchy (also called 
a generalization hierarchy in the literature) Ti. Figure [T] shows 
the taxonomy hierarchies of Relationship, a categorical variable, 
and Education-num, a numerical variable. In the hierarchy, the root 
node represents the whole domain of the variable, and a parent node 
is a generalization (or a cover) of its children. Child nodes under 
the same parent node are semantically related; they are closer to 
each other than to nodes under a different parent node. 

Each level of a predictor variable’s taxonomy hierarchy forms a 
partition of its domain. On the basis of the taxonomy hierarchy and 
its levels, we introduce the notion of a grid. 

Definition 1 (Grid). Let A = {Ai,..., Ad} be the set of 
predictor variables in a dataset and {7i ,... ,Td} be their taxon¬ 
omy hierarchies respectively. Let hi be the height ofTi, 1 < i < d. 
Then, a grid g is given by (£i,..., If), where 1 < £i < hi and 
1 < i < d. A grid defines a partition of the data domain into cells 
where each attribute Ai is partitioned into the values at level li. 
The number of cells of a grid is where iTilL]] is the 

number of nodes in the level U of the hierarchy Ti. And the number 
of all possible grids is Ilf^ihi. 

Definition 2 (Histogram). Given a dataset D and a grid 
g, a histogram H(D, g) partitions D into cells according to g, and 
outputs the numbers of positive instances and negative instances in 
each cell. 

By injecting Laplace noise into the positive counts and negative 
counts of each cell in the histogram H(I?, g), we get the noisy ver¬ 
sion of it, H(D, g). 

3.2 Histogram Publishing for Classification 

Given a dataset D, the taxonomy hierarchies of its predictor vari¬ 
ables, a total privacy budget e, and the number of tuples in the 
dataset N (a rough estimate suffices), we generate a candidate pool 
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Figure 1: Taxonomy hierarchies of Relationship attribute and Education-num attribute 


of all grids whose number of cells are below a threshold, which is 
determined by e and N. We compute the quality score for each grid, 
which measures the usefulness for classification of each grid in the 
pool. We then apply the exponential mechanism 1281 to privately 
select a grid, and finally publish a noisy histogram using g. 

A key technical challenge is to come up with a low-sensitivity 
quality function that can measure the desirability of choosing a par¬ 
ticular grid g. We publish H(D, g), a noisy histogram of D using 
g to partition the data domain, and desire that classifiers learned 
from H(Z), g) are close to classifiers learned from D. Furthermore, 
we desire this to hold regardless of which particular classification 
algorithm is used. We propose to define the quality function to min¬ 
imize the misclassification error (when measured using the dataset 
D) for the classifier defined by the histogram H(D, p), i.e., for each 
cell in the grid defined by g, it predicts the majority class accord¬ 
ing to H{D,g). This classifier is in the same spirit as histogram 
classifiers |T0l[3Tl, and we use to denote it. 

Suppose that a grid is able to separate positive and negative data 
points very well, then, even after adding the noises, this separation 
feature is still preserved and can be used to learn classifiers. When 
no noise is added, the finest partition is desired. Flowever, with 
noise, we want to ensure that the noises do not overwhelm the true 
counts. Since when a grid g is fixed, the noisy histogram includes 
random noises, the misclassification error is a random variable, and 
we use the expected value of this error as the quality function. 

Definitions (Quality of grid). Given a dataset D and 
a grid g, the grid quality is measured by the expected misclassifi¬ 
cation error of the histogram classifier 

qual^(5i) = E,[err{HC^^^’^\ D)]. 

The following Lemma shows how to compute qual^((;). 


Where c denotes the application of the histogram classifier 
to the cell c. 

For cell c, if the added Laplace noises do not change the ma¬ 
jority class label, then the number of misclassified input tuples is 
min(n)!", ); otherwise, it is max{n'^ ,nf). Thus, 

E[err(c, D)] = min(nj!', n~) ■ Pc -i- max(nj!', nj) • (1 — pc), 

( 2 ) 


where pc is the probability that the majority class label in c does 
not change after injecting Laplace noises. 

Let Z\ and Z 2 be the Laplace noises added to the majority class 
and the minority class of cell c, respectively, then 

Pc = Pr \Z 2 - Zi < |n+ -nf\]. (3) 


Lemma 2 (US). Let Z\ and Z 2 be two i.i.d. random vari¬ 

ables that follow the Laplace distribution with mean 0 and scale 2. 
Then the density of their difference Y = Z\ — Z 2 is 

/Y(y) = + e|i/|) -oo<y<oo, 


and the corresponding cumulative distribution function is 


1 - 


-fV(y) = 


(i + f). ify>o, 

( ^V\ 

- I 1 - 2 ) otherwise. 

• 2 V 2 / ’ 

From Equations (H and we have 


Pc = 1 - 


„-e|ra+-rt^ I 


1-P 


(4) 


(5) 


Plugging Equation (lU into Equation ([2} provides Lemma[T] 

The lemma below bounds the sensitivity of our quality function. 


Lemma 1 (Quality of grid). Given a dataset D and a grid 
g, tfor the parameter of adding Laplacian noise to the counts, we 
have 


qualo(g) 


-E 


min(njt 



+ max(nj!', ric ) 




( 1 ) 


where c ranges over all cells in the grid, n'f is the number of posi¬ 
tive data points in c, n“ is the number of negative data points in c, 
and Xc = \nt — nf \. 


Lemma 3. For any e > 0, the global sensitivity of the quality 
function\J\is B{e), where 


B{t) = X ■ 


-e(a:-l) 


1-f 


c{x — 1) \ e 


(' + ?) 


+ 1 - 


f e{x-l) 


1-P 


and 


ee' -I- ^2 — (4 — e^) 6*= -I- 2e^' 


—e -I- ee" 


To prove Lemma [T] we note that qual^((;) can be further de¬ 
composed into the sum of expected misclassification error at each 
perturbed cell of the histogram after majority voting, and thus 

qualo(g) = ^E[err(c,D)] 

c€g 


3.3 Correlation-based Feature Selection 

Our basic solution privately selects a grid from the candidate 
pool to release synthetic data. As the number of predictor variables 
increases, the candidate pool size grows exponentially, giving arise 
to a scalability issue. Eortunately, usually in a real dataset some 































Algorithm 1 PrivPfC: Privately Publishing Data for Classification 
Input: dataset D, the set of predictor variables and their 

taxonomy hierarchies, total privacy budget e, maximum grid pool 
size B, median of the first branching factors of hierarchies b. 
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function main(_D, T, N, e, B, b) 
T ^5- N ■t/2 
H <— Enumerate(J^, T) 

if I H I > B then 


tfs ^— 0.3e, tsh ^— 0.3e, €ph ^— 0.4e 
X selectFeature(D, B, k, e/g) 

Hx •<— Enumerate(X, T) 

1 •<— PrivateHistogramPubhshing(i9, Hx, ^sh, ^ph) 

else 

^sh fc, Cp/i •<— ye 

I ^ PrivateHistogramPubhshing(_D, H, €ah, ^ph) 

end if 
return / 
end function 


16: function PrivateHistogramPublishing(D, H, egh, ^ph) 
17: •<— selectHist(_D, H, Esti) 

18: / <— perturbHist(_D,/i, Cph) 

19: return I 

20: end function 


21: function selectHistfD, H, tsh) 

22: for i = 1|H| do 

23: 5i^qual(Hi) 

24: Pi ^ 

25: end for 

26: h •<— sample i £ [1-.|H|] according to pi 

27: return h 

28: end function 


29: function perturbHist(Z),/i, Ep/i) 

30: Initialize I to empty 

3 1: for each cell c € hdo 

32: n+ t—n+ +Lap(l/£p/i) 

33: n” •(— n” + Lap(l/ep;,) 

34: Add (n^, ) to I 

35: end for 

36: Round ah counts of I to their nearest non-negative integers. 

37: return / 

38: end function 


39: function selectFeature(D, T, k, e/s) 

40: Initialize X to empty 

41: Let i? be the response variable in Z) 

42: for each Ai & T do 

43: coTi t— Cor {Ai, R, D) 

'/a 

44: Pi <— e i*: 

45: end for 

46: for i = 1 —>• fc do 

47: / t—sample Ai £ X according topi 

48: Add f to X 

49: Remove / from X 

50: end for 

51: return X 

52: end function 


predictor variables are not useful for predicting the class labels. 
Such irrelevant variables can thus be excluded from the classifica¬ 
tion to improve the scalability of our solution. 

Feature selection HD is the process of selecting a subset of im¬ 
portant features (predictor variables) to build a classification model. 
Various feature selection methods have been proposed including 
wrapper method, embedded methods, stepwise regression OH- How¬ 
ever, they require building a large number of classification models, 
one for a subset of features one wants to evaluate. It is unclear how 
to adapt these methods to satisfy differential privacy. We propose 
a simple but effective approach, which selects predictor variables 
based on a correlation analysis between predictor variables and the 
class (i.e., target variable). We adapt the correlation test 1201 to 
have a low sensitivity. 

Given a dataset D with N tuples, the correlation test 1201 
evaluates whether categorical variables A and B are correlated. 
Suppose that variable A has m distinct values, ai,..., am, and 
B has n distinct values, bi,... ,b„. The correlation between A 
and B (a.k.a Pearson x^ statistic) is defined as 

m n y >.2 

= ■ (« 

• 1 1 

1=1 j=i 

where Oij is the observed number of tuples with A = ai and 
B = bj in dataset D, and the expected count aj is computed by 
assuming A and B are independent 

_ count(A = ai) x count(i3 = bj) 

S-ij — jY , (7) 

where count (A = ai) returns the number of tuples in dataset D 
with A = ai. 

Clearly, if A and B are independent, then x^(^, B) = 0. The 
bigger the x^ value is, the stronger the correlation of variables A 
and B is. ^^{A^B), however, has a large global sensitivity, be¬ 
cause the ei/’s in Equation can be very small. Our analysis 
(omitted for space limitation) shows that it is at least ■ 

We adapt x^ correlation test to define the correlation Cor(A, R) 
between a predictor variable A with m distinct values and the bi¬ 
nary response variable R as: 

m 2 

Cor (A,i7) = |oi/- ei/l, (8) 

i=l j = l 

where Oij is the observed number of tuples with A = ai and R = 

Tj in D, and the expected count dj is computed by assuming A 
and R are independent as in Equation Q. 

Lemma 4. Let A and R be a categorical variable and binary 
response variable in a dataset D, respectively. Then, the global 
sensitivity of Function Cor {A, R, D) is 2. 

The proof is in the Appendix Section. 

3.4 The Algorithm 

We now present the full algorithm (Algorithm [Til for our frame¬ 
work of releasing private data for classification tasks. 

Line Insets the threshold of the maximum number of cells in a 
grid, to prevent the average counts from being dominated by the 
injected noises. That is, 

E 

which means that the average noise magnitude is no more than the 
20% of the average cell count. 
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When feature selection is deemed necessary, we allocate 30% 
of the privacy budget to privately select k predictor variables that 
are strongly correlated with the response variable. These selected 
variables are then used to release private synthetic data. 

The number of attributes to be selected, k, is based on T, the 
maximum grid size. We want to have enough attributes so that 
relevant attributes are included. At the same time, we do not want 
k so large so that there are too many candidate grids with size below 
T. Let b be the median of the first branching factors of hierarchies 
of all attributes. We set k to be . 

log(6) 

Theorem 1. Algorithm\I]satisfies e-dijferentialprivacy. 

Theorem Q] shows that Algorithm PrivPfC satisfies 
e-differential privacy. The proof of Theorem Q] is thus straight¬ 
forward by considering the sequential composability of differential 
privacy as discussed in Section|2] 

4. EXPERIMENT 

4.1 Experimental Settings 

Datasets. We use 4 real datasets for our experiments. The first one 
is the Adult dataset from the UCI machine learning repository (T). 
It contains 6 numerical attributes and 8 categorical attributes, and is 
widely used for evaluating the performance of classification algo¬ 
rithms. After removing missing values, the dataset contains 45,222 
tuples. The second dataset is the Bank marketing dataset from the 
same repository. It contains 10 numerical attributes and 10 cate¬ 
gorical attributes on 41,188 individuals. The third is the US dataset 
from the Integrated Public Use Microdata Series (IPUMS) 1361 . 
It has 39,186 the United States census records in 2010, with 15 
numerical attributes and 31 categorical ones. The last is the BR 
dataset (also from IMPUS), which contains 57,333 Brazil census 
records in 2010 and has 14 numerical attributes and 28 categorical 
ones. The classification tasks for the Adult, US and BR datasets 
are to predict whether an individual has an income above a certain 
threshold. The one for the Bank dataset is to predict whether a 
client will subscribe a term deposit. Table [T] summarizes the char¬ 
acteristics of the datasets. 

Taxonomy Hierarchies. For the Adult dataset, we use the same 
taxonomy hierarchies as DiffGen 1301 . For the remaining 3 datasets, 
we do the following. For numerical attributes, we partition each do¬ 
main into equal size bins and build hierarchies over them. For cat¬ 
egorical attributes, we build taxonomy hierarchies by considering 
the semantic meanings of the attribute values. 

Competing Methods. We compare PrivPfC with 6 state-of-the- 
art methods in terms of misclassification rate. These include 3 
non-interactive methods, DiffGen 1301 , PrivBayes EH and Private 
Projected Histogram (PPH) 1391 , which privately release synthetic 
datasets for classification analyses, and 3 interactive methods, Priv- 
Gene (46), DiffPC-4.5 (TTl . and PrivateERM (7), which includes 
one method for decision tree, and two methods for SVM. 

DiffGen. (30) consists of two steps, partition and perturbation. The 
partition step first generalizes all attribute’s values into the topmost 
nodes in their taxonomy hierarchies and then iteratively selects one 
attribute at a time for specialization, using the exponential mech¬ 
anism. The quality of each candidate specialization is based on 
the same heuristics as used by the decision tree algorithms, such 
as information gain and majority class. As suggested in lf30l . we 
use the majority class to measure the candidate quality, and set the 
number of specialization steps to be 10 for the Adult dataset and 


the bank dataset. For the US and BR datasets, we set the number 
to be 6 and 8 respectively, as beyond these numbers, the DiffGen 
implementation runs into memory problems. The perturbation step 
injects Laplace noise into each cell of the partition and outputs all 
the cells with their noisy counts as the noisy synopsis of the data. 

PrivBayes. (45) determines the structure of a Bayesian network by 
first randomly select an attribute as the first node, and then itera¬ 
tively select one attribute and up to k nodes as the attribute’s parent 
nodes, which have the maximum mutual information. After the 
structure is determined, PrivBayes perturbs the marginals needed 
for computing the conditional distributions. The performance of 
the PrivBayes algorithm depends on k. We set fc = 3 for the Adult 
dataset and the Bank dataset, which is the same as the one used 
in (45). For the US and BR datasets, which were not used in (45), 
setting fc = 3 runs out of memory in our experiments because of 
the larger dimensionality; we set A: = 2 for them. 

PPH. 1391 starts with a feature selection procedure to select a set of 
k features that have the maximal discernibility. Then, it uses the se¬ 
lected features to build a histogram. For each categorical attribute, 
the full domain is used. For numerical attribute, it uses the formula 
proposed in Lei 1261 to decide how many bins to discretize them. 

PrivGene 1461 is a general-purpose private model fitting frame¬ 
work based on genetic algorithms, which can be applied to SVM 
classification. DiffPC-4.5 GD is an interactive private algorithm 
for building a C4.5 decision tree classifier differential-privately. 
PrivateERM m is an interactive private algorithm for construct¬ 
ing SVM classifier by injecting noise into the risk function first and 
then optimizing the perturbed risk function. 

The source codes of the DiffGen, PrivBayes, PPH, DiffPC-4.5, 
PrivGene were downloaded from 1291 . ES), E3. HI) and (44) , 
respectively. The source code of PrivateERM was shared by the 
authors of PrivBayes (45) 

Evaluation Methodology. We consider two baselines - Majority 
and NoiseFree. Majority is the misclassification rate by majority 
voting on the class attribute, which predicts each test case with the 
majority class label in the train dataset. NoiseFree is the misclas¬ 
sification rate of a decision tree or SVM classifier built on the true 
data. We expect that a good algorithm to perform better than Ma¬ 
jority, and gets close to NoiseFree as e increases. 

The evaluation is based on two classification models: the CART 
decision tree classifier and the SVM classifier with radial basis ker¬ 
nel. Interactive approaches DiffPC-4.5 and PrivateERM build pri¬ 
vate classifiers directly. And we use parameters suggested by the 
corresponding papers. The non-interactive approaches PrivPfC, 
PPH, DiffGen, and PrivBayes generate private synthetic datasets. 
To evaluate their performance in terms of decision tree model, we 
use the rpart (37) library to build decision trees on their generated 
synthetic datasets. For the evaluation in terms of SVM model, we 
use the LibSVM package (3 to build SVM classifiers on the syn¬ 
thetic datasets. We use the same set of parameters of rpart and 
LibSVM respectively in evaluating the above non-interactive ap¬ 
proaches. 

For all the experiments, we vary e from 0.05 to 1.0. Similar to 
the experiment settings of I17II30|[3^ , under each privacy budget, 
we execute 10-fold stratified cross-validation to evaluate the mis¬ 
classification rate of the above methods. For each train-test pair, 
we run the target method 10 times. We report the average measure¬ 
ments over the 10 runs and the 10-fold crossvalidations. We set 
the maximum grid pool size to be 200,000. The implementation 
and experiments of PrivPfC were done in Python 2.7 and all exper¬ 
iments were conducted on an Intel Core 17-3770 3.40GHz PC with 
16GB memory. 



Dataset 

#Dim 

# Numerical 

# Categorical 

# Records 

Classification Task 

Adult 

15 

6 

8 

45,222 

Determine whether a person makes over 50K a year. 

Bank 

21 

10 

10 

41,188 

Determine whether the client subscribed a term deposit. 

US 

47 

15 

31 

39,186 

Determine whether a person makes over 50K a year. 

BR 

43 

14 

28 

57,333 

Determine whether a person makes over 300 per month. 


Table 1: Dataset characteristics 



Methods 

Description 

Non-Interactive 

PrivPfC 

Our proposed method. 

PrivPfC-SelNF 

Our proposed method with noise free feature selection and histogram selection. 

PrivPfC-FSNF 

Our proposed method with noise free feature selection. 

DiffGen i30l~ 

Private data release for classification via recursive partitioning. 

DiffGen-NF 

Noise free DiffGen. 

PrivBayes 1451 

Private Data Release via Bayes network. 

PrivBayes-NF 

Noise free PrivBayes. 

PPH 1391 

Private data release for classification by projection and perturbation. 

Interactive 

DiffPC-4.5 117] 

Privately construct C4.5 decision tree classifier. 

PrivGene 1461 

Private model fitting based on genetic algorithms. 

PrivateERM (7J 

Private classifier construction based on empirical risk minimization. 


Table 2: Summary of differentially private classification methods 
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Figure 2: Comparison of PrlvPfC, DiffGen, PrlvBayes, PPH and DiffPC-4.5 by decision tree classification, x-axis: privacy budget e 
in log-scale, y-axls: misclassificatlon rate in log-scale. 


4.2 Comparison against Competitors 

Comparison on Decision Tree. Five approaches are involved: 
PrivPfC, DiffGen, PrivBayes, PPH and DiffPC-4.5. Figure [^re¬ 
ports their average misclassification rates and the corresponding 
standard deviations. Clearly, PrivPfC has the best performance. 


followed by DiffGen, PPH, DiffPC-4.5. PrivBayes is the poorest 
in most cases. The performance of PrivPfC is also the most ro¬ 
bust, as can be seen from the fact that the standard deviation of its 
misclassification rates is the lowest. 


Comparison on SVM. We compare 6 approaches: PrivPfC, Diff- 
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Figure 3: Comparison of PrivPfC, DiffGen, PrivBayes, PPH, 
budget 6 in log-scale, y-axis: misclassification rate in log-scale. 

Gen, PrivBayes, PPH, PrivGene, and PrivateERM. Figure[3reports 
the experimental results. Once again, PrivPfC has the best per¬ 
formance, followed by DiffGen, PrivGene, PPH, PrivateERM and 
PrivBayes. 

Effectiveness of Private Feature Selection. In FigureSl we eval¬ 
uate our private feature selection method on the US dataset under 
privacy budget 0.1. We create a variant of PrivPfC, called PrivPfC- 
FSNF, in which the feature selection step of PrivPfC is noise-free 
and all the privacy budget is used in remaining steps. PPH is in¬ 
cluded in the comparison since it also has a private feature selection 
step. We create variants for each of the rest competitors, by adding 
our proposed feature selection method as preprocessing step which 
uses 30% of the total privacy budget. 

From Figure 131 we can see that our PrivPfC algorithm has close 
performance to its counterpart (PrivPfC-FSNF). This justifies that 
fact that the set of attributes PrivPfC selects for grid partition is 
almost as good as those selected by PrivPfC-FSNF and the effec¬ 
tiveness of PrivPfC mainly comes from the private histogram se¬ 
lection. We can also see that although PrivBayes, DiffPC-4.5 and 
PrivateERM’s performances are improved significantly by doing 
our private feature selection step, they are still outperformed by 
PrivPfC. 

4.3 Analyses of Sources of Errors 

PrivPfC distributes the privacy budget among three steps, feature 
selection, grid selection and perturbation, in a 30%-30%-40% way. 
When feature selection is not needed, the privacy budget is divided 
between grid selection and perturbation in a ratio of 3:4. While 


PrivGene and PrivateERM by SVM classiflcation. x-axis: privacy 


these ratios are somewhat arbitrary, we have experimentally eval¬ 
uated other ratios, allocating between 20% and 60% to each step. 
We have found that the differences among different budget alloca¬ 
tions are minor, so long as the last step receives at least 30% of the 
privacy budget. Even with the worst allocation, which gives 20% 
to the last step, PrivPfC still clearly outperforms competing meth¬ 
ods. We also consider a variant of PrivPfC, called PrivPfC-SelNF, 
in which the feature selection step and histogram selection step are 
noise free and all the privacy budget is used in the histogram per¬ 
turbation step. PrivPfC-SeINF is not private; it shows the best one 
can hope to achieve by optimizing the division among steps. 

We have seen that PrivPfC outperforms the other non-interactive 
methods such as DiffGen and PrivBayes. The key difference in 
PrivPfC is that we choose the grid g holistically, instead of arriving 
at the final grid through a series of decisions. For example, Dif¬ 
fGen iteratively chooses the attributes and ways to partition them, 
and PrivBayes iteratively builds a Bayesian network. There are two 
reasons why such an iterative approach does not perform well. The 
first is that the decisions made in each iteration may be sub-optimal 
because of the perturbation necessary for satisfying differential pri¬ 
vacy. The second is that even if the decision made in each iteration 
is locally optimal, the combination of them is not globally opti¬ 
mal. To see to what extent the latter factor affects accuracy, we 
consider noise free variants of them respectively, DiffGen-NF and 
PrivBayes-NF. In these variants the decisions in each iteration as 
well as the publishing of counts in the end are performed without 
any perturbation. They represent DiffGen and PrivBayes when the 
privacy budget e goes to oo. 
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Figure 4: Effectiveness of Private Feature Selection on US dataset with e = 0.1. y-axis: misclassiflcation rate. 
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Figure 5: Analyses of PrivPfC, DiffGen and PrivBayes by decision tree classification, x-axis: privacy budget e in log-scale, y-axis: 
misclassiflcation rate in log-scale. 


Figurel^and Figure[^report the experimental results of compar¬ 
ing these methods, using Decision Tree and SVM, respectively. We 
first observe that while PrivPfC-SelNF indeed outperforms PrivPfC, 
the difference is very small, especially for larger e values in the 
range. In fact, on Adult, US, and BR datasets, the difference is 
barely noticeable when e > 0.1. This suggests that little improve¬ 
ment can be gained to further optimize the division of privacy bud¬ 
get or dataset among determining grid g and publishing noisy his¬ 
togram. 


We also observe that the non-private noise-free version of PrivBayes 
still performs poorly; in fact, it performs significantly worse than 
the private PrivPfC. This suggests that the iterative Bayes network 
construction approach is not suitable for the purpose of building 
accurate classifiers. This is perhaps due in large part to the fact that 
it is not designed originally to optimize for classification. 

The non-private DiffGen-NF performs similarly to PrivPfC and 
PrivPfC-SeINF on the Adult and US datasets. On the Bank dataset, 
it is outperformed by PrivPfC and PrivPfC-SeINF when e > 0.15. 




































































Majority - PrivPfC-SVM i -ci -- 

NoiseFree - PrivPfC-SelNF-SVM : 


DiffGen-SVM I- -1 PrivBayes-SVM 1- - o- -1 
DiffGen-NF-SVM I--B -1 PrivBayes-NF-SVM 



(a) Adult 



(c) US 



(b) Bank 



(d) BR 


Figure 6: Analyses of PrivPfC, DiffGen and PrivBayes by SVM classification, x-axis: privacy budget e in log-scale, y-axis: misclas- 
sification rate in log-scale. 


On the BR dataset, DiffGen-NF performs significantly worse than 
PrivPfC and PrivPfC-SelNF. This suggests that the inherent itera¬ 
tive structure of DiffGen is suboptimal, even without considering 
the effect of perturbation. 

5. CONCLUSION 

In this paper, we have introduced PrivPfC, a novel framework 
for publishing data for classification under differential privacy. As 
a core part of PrivPfC, we have introduced a novel quality func¬ 
tion that enables the selection of a good “grid” for publishing noisy 
histograms. We have also introduced a new techinque for privately 
selecting of most relevant features for classification, which enables 
PrivPfC to scale to higher-dimension datasets. We have conducted 
extensive experiments on four real datasets, and the results show 
that our approach greatly outperforms several other state-of-the-art 
methods for private data publishing as well as private classification. 
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7. APPENDIX 

Proof of Lemma|4) Without loss of generality, we assume the cell 
On is changed by 1. We use Ci to denote the number of tuples with 
R = Ti, i — 1 , 2 . 

A = |corr(/, D') - corr(/, D)\ 


+ E 


On + 1 - 


Oil — 


(on + 012 + l)(ci + 1 ) 

Cl -F C2 -F 1 

(Oil + Oi2)(ci -F 1) 

Cl -F C2 -F 1 


- on 


(on -F 0 i 2 )ci 


- Oil - 


Cl -F C 2 
(Oil -F Oi2)ci 


Cl -F C 2 


^ _ (on + 012 -F l)(ci + 1 ) ^ (on + oi2)ci 
Cl -F C 2 -F 1 Cl -F C 2 


+ E 


(oii -F Oi 2 )(ci -F 1) (oii -F Oi 2 )ci 
Cl -F C 2 -F 1 Cl -F C 2 


_ (ci + C2)(ci + C2 -F 1 ) — (on -F oi2)c2 — (ci -F l)(ci -F C2) 
(ci -F C2)(ci -F C2 -F 1 ) 

_l_ C2((ci - On) -F (C 2 - 012)) 

(ci -F C2)(ci -F C2 -F 1 ) 

_ 2 c 2 ((ci - On) + (C2 - 012)) 

(ci -F C2)(ci -F C2 -F 1 ) 


< 2 . 
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